AITopics | self-supervised vision transformer

Collaborating Authors

self-supervised vision transformer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MST: MaskedSelf-SupervisedTransformerfor VisualRepresentation

Neural Information Processing SystemsFeb-9-2026, 06:42:38 GMT

Based oninstance discrimination [14, 4, 6, 5, 13, 1], some methods show the effectiveness in the image classification task.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country: Asia > China (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.35)

Add feedback

Adapting Self-Supervised Vision Transformers by Probing Attention-Conditioned Masking Consistency

Neural Information Processing SystemsDec-24-2025, 19:52:21 GMT

Visual domain adaptation (DA) seeks to transfer trained models to unseen, unlabeled domains across distribution shift, but approaches typically focus on adapting convolutional neural network architectures initialized with supervised ImageNet representations. In this work, we shift focus to adapting modern architectures for object recognition -- the increasingly popular Vision Transformer (ViT) -- initialized with modern pretraining based on self-supervised learning (SSL). Inspired by the design of recent SSL approaches based on learning from partial image inputs generated via masking or cropping -- either by learning to predict the missing pixels, or learning representational invariances to such augmentations -- we propose PACMAC, a two-stage adaptation algorithm for self-supervised ViTs. PACMAC first performs in-domain SSL on pooled source and target data to learn task-discriminative features, and then probes the model's predictive consistency across a set of partial target inputs generated via a novel attention-conditioned masking strategy, to identify reliable candidates for self-training. Our simple approach leads to consistent performance gains over competing methods that use ViTs and self-supervised initializations on standard object recognition benchmarks.

attention-conditioned masking consistency, name change, self-supervised vision transformer, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.97)

Add feedback

CoughViT: A Self-Supervised Vision Transformer for Cough Audio Representation Learning

Luong, Justin, Xue, Hao, Salim, Flora D.

arXiv.org Artificial IntelligenceAug-7-2025

Physicians routinely assess respiratory sounds during the diagnostic process, providing insight into the condition of a patient's airways. In recent years, AI-based diagnostic systems operating on respiratory sounds, have demonstrated success in respiratory disease detection. These systems represent a crucial advancement in early and accessible diagnosis which is essential for timely treatment. However, label and data scarcity remain key challenges, especially for conditions beyond COVID-19, limiting diagnostic performance and reliable evaluation. In this paper, we propose CoughViT, a novel pre-training framework for learning general-purpose cough sound representations, to enhance diagnostic performance in tasks with limited data. To address label scarcity, we employ masked data modelling to train a feature encoder in a self-supervised learning manner. We evaluate our approach against other pre-training strategies on three diagnostically important cough classification tasks. Experimental results show that our representations match or exceed current state-of-the-art supervised audio representations in enhancing performance on downstream tasks.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2508.03764

Country:

North America (0.28)
Oceania > Australia > New South Wales (0.14)

Genre: Research Report > New Finding (0.48)

Industry:

Health & Medicine > Therapeutic Area > Pulmonary/Respiratory Diseases (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Adapting Self-Supervised Vision Transformers by Probing Attention-Conditioned Masking Consistency

Neural Information Processing SystemsJan-17-2025, 21:34:18 GMT

attention-conditioned masking consistency, pacmac, self-supervised vision transformer

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Self-Supervised Vision Transformers for Writer Retrieval

Raven, Tim, Matei, Arthur, Fink, Gernot A.

arXiv.org Artificial IntelligenceSep-1-2024

While methods based on Vision Transformers (ViT) have achieved state-of-the-art performance in many domains, they have not yet been applied successfully in the domain of writer retrieval. The field is dominated by methods using handcrafted features or features extracted from Convolutional Neural Networks. In this work, we bridge this gap and present a novel method that extracts features from a ViT and aggregates them using VLAD encoding. The model is trained in a self-supervised fashion without any need for labels. We show that extracting local foreground features is superior to using the ViT's class token in the context of writer retrieval. We evaluate our method on two historical document collections. We set a new state-at-of-art performance on the Historical-WI dataset (83.1\% mAP), and the HisIR19 dataset (95.0\% mAP). Additionally, we demonstrate that our ViT feature extractor can be directly applied to modern datasets such as the CVL database (98.6\% mAP) without any fine-tuning.

dataset, international conference, vision transformer, (14 more...)

arXiv.org Artificial Intelligence

2409.00751

Country:

Oceania > Australia > Australian Capital Territory > Canberra (0.04)
Europe > Middle East > Malta > Port Region > Southern Harbour District > Valletta (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Transfer Learning with Self-Supervised Vision Transformers for Snake Identification

Miyaguchi, Anthony, Gustineli, Murilo, Fischer, Austin, Lundqvist, Ryan

arXiv.org Artificial IntelligenceJul-8-2024

We present our approach for the SnakeCLEF 2024 competition to predict snake species from images. We explore and use Meta's DINOv2 vision transformer model for feature extraction to tackle species' high variability and visual similarity in a dataset of 182,261 images. We perform exploratory analysis on embeddings to understand their structure, and train a linear classifier on the embeddings to predict species. Despite achieving a score of 39.69, our results show promise for DINOv2 embeddings in snake identification.

dataset, prediction, snake, (13 more...)

arXiv.org Artificial Intelligence

2407.06178

Country:

North America > United States > Georgia > Fulton County > Atlanta (0.04)
Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
Africa > Sub-Saharan Africa (0.04)

Genre: Research Report > New Finding (0.68)

Industry: Information Technology (0.69)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Self-supervised Vision Transformer are Scalable Generative Models for Domain Generalization

Doerrich, Sebastian, Di Salvo, Francesco, Ledig, Christian

arXiv.org Artificial IntelligenceJul-3-2024

Despite notable advancements, the integration of deep learning (DL) techniques into impactful clinical applications, particularly in the realm of digital histopathology, has been hindered by challenges associated with achieving robust generalization across diverse imaging domains and characteristics. Traditional mitigation strategies in this field such as data augmentation and stain color normalization have proven insufficient in addressing this limitation, necessitating the exploration of alternative methodologies. To this end, we propose a novel generative method for domain generalization in histopathology images. Our method employs a generative, self-supervised Vision Transformer to dynamically extract characteristics of image patches and seamlessly infuse them into the original images, thereby creating novel, synthetic images with diverse attributes. By enriching the dataset with such synthesized images, we aim to enhance its holistic nature, facilitating improved generalization of DL models to unseen domains. Extensive experiments conducted on two distinct histopathology datasets demonstrate the effectiveness of our proposed approach, outperforming the state of the art substantially, on the Camelyon17-wilds challenge dataset (+2%) and on a second epithelium-stroma dataset (+26%). Furthermore, we emphasize our method's ability to readily scale with increasingly available unlabeled data samples and more complex, higher parametric architectures.

dataset, domain generalization, synthetic image, (11 more...)

arXiv.org Artificial Intelligence

2407.029

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > Germany > Bavaria (0.04)

Genre: Research Report > New Finding (0.47)

Industry: Health & Medicine > Diagnostic Medicine (0.47)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Visualizing the loss landscape of Self-supervised Vision Transformer

Lee, Youngwan, Willette, Jeffrey Ryan, Kim, Jonghee, Hwang, Sung Ju

arXiv.org Artificial IntelligenceMay-28-2024

The Masked autoencoder (MAE) has drawn attention as a representative self-supervised approach for masked image modeling with vision transformers. However, even though MAE shows better generalization capability than fully supervised training from scratch, the reason why has not been explored. In another line of work, the Reconstruction Consistent Masked Auto Encoder (RC-MAE), has been proposed which adopts a self-distillation scheme in the form of an exponential moving average (EMA) teacher into MAE, and it has been shown that the EMA-teacher performs a conditional gradient correction during optimization. To further investigate the reason for better generalization of the self-supervised ViT when trained by MAE (MAE-ViT) and the effect of the gradient correction of RC-MAE from the perspective of optimization, we visualize the loss landscapes of the self-supervised vision transformer by both MAE and RC-MAE and compare them with the supervised ViT (Sup-ViT). Unlike previous loss landscape visualizations of neural networks based on classification task loss, we visualize the loss landscape of ViT by computing pre-training task loss. Through the lens of loss landscapes, we find two interesting observations: (1) MAE-ViT has a smoother and wider overall loss curvature than Sup-ViT. (2) The EMA-teacher allows MAE to widen the region of convexity in both pretraining and linear probing, leading to quicker convergence. To the best of our knowledge, this work is the first to investigate the self-supervised ViT through the lens of the loss landscape.

loss landscape, transformer, vision transformer, (13 more...)

arXiv.org Artificial Intelligence

2405.18042

Country: Asia > South Korea (0.05)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

SSVT: Self-Supervised Vision Transformer For Eye Disease Diagnosis Based On Fundus Images

Wang, Jiaqi, Kang, Mengtian, Liu, Yong, Zhang, Chi, Liu, Ying, Li, Shiming, Qi, Yue, Xu, Wenjun, Tang, Chenyu, Occhipinti, Edoardo, Yusufu, Mayinuer, Wang, Ningli, Bai, Weiling, Gao, Shuo, Occhipinti, Luigi G.

arXiv.org Artificial IntelligenceApr-20-2024

Machine learning-based fundus image diagnosis technologies trigger worldwide interest owing to their benefits such as reducing medical resource power and providing objective evaluation results. However, current methods are commonly based on supervised methods, bringing in a heavy workload to biomedical staff and hence suffering in expanding effective databases. To address this issue, in this article, we established a label-free method, name 'SSVT',which can automatically analyze un-labeled fundus images and generate high evaluation accuracy of 97.0% of four main eye diseases based on six public datasets and two datasets collected by Beijing Tongren Hospital. The promising results showcased the effectiveness of the proposed unsupervised learning method, and the strong application potential in biomedical resource shortage regions to improve global eye health.

dataset, diagnosis, ssvt, (10 more...)

arXiv.org Artificial Intelligence

2404.13386

Country:

Asia > China > Beijing > Beijing (0.27)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Oceania > Australia > Victoria > Melbourne (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Ophthalmology/Optometry (1.00)
Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (0.31)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Improving Visual Prompt Tuning for Self-supervised Vision Transformers

Yoo, Seungryong, Kim, Eunji, Jung, Dahuin, Lee, Jungbeom, Yoon, Sungroh

arXiv.org Artificial IntelligenceJun-8-2023

Visual Prompt Tuning (VPT) is an effective tuning method for adapting pretrained Vision Transformers (ViTs) to downstream tasks. It leverages extra learnable tokens, known as prompts, which steer the frozen pretrained ViTs. Although VPT has demonstrated its applicability with supervised vision transformers, it often underperforms with self-supervised ones. Through empirical observations, we deduce that the effectiveness of VPT hinges largely on the ViT blocks with which the prompt tokens interact. Specifically, VPT shows improved performance on image classification tasks for MAE and MoCo v3 when the prompt tokens are inserted into later blocks rather than the first block. These observations suggest that there exists an optimal location of blocks for the insertion of prompt tokens. Unfortunately, identifying the optimal blocks for prompts within each self-supervised ViT for diverse future scenarios is a costly process. To mitigate this problem, we propose a simple yet effective method that learns a gate for each ViT block to adjust its intervention into the prompt tokens. With our method, prompt tokens are selectively influenced by blocks that require steering for task adaptation. Our method outperforms VPT variants in FGVC and VTAB image classification and ADE20K semantic segmentation. The code is available at https://github.com/ryongithub/GatedPromptTuning.

artificial intelligence, machine learning, prompt token, (17 more...)

arXiv.org Artificial Intelligence

2306.05067

Country:

Asia > South Korea > Seoul > Seoul (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback